14 research outputs found
Private Set Generation with Discriminative Information
Differentially private data generation techniques have become a promising
solution to the data privacy challenge -- it enables sharing of data while
complying with rigorous privacy guarantees, which is essential for scientific
progress in sensitive domains. Unfortunately, restricted by the inherent
complexity of modeling high-dimensional distributions, existing private
generative models are struggling with the utility of synthetic samples.
In contrast to existing works that aim at fitting the complete data
distribution, we directly optimize for a small set of samples that are
representative of the distribution under the supervision of discriminative
information from downstream tasks, which is generally an easier task and more
suitable for private training. Our work provides an alternative view for
differentially private generation of high-dimensional data and introduces a
simple yet effective method that greatly improves the sample utility of
state-of-the-art approaches.Comment: NeurIPS 2022, 19 page
Data Forensics in Diffusion Models: A Systematic Analysis of Membership Privacy
In recent years, diffusion models have achieved tremendous success in the
field of image generation, becoming the stateof-the-art technology for AI-based
image processing applications. Despite the numerous benefits brought by recent
advances in diffusion models, there are also concerns about their potential
misuse, specifically in terms of privacy breaches and intellectual property
infringement. In particular, some of their unique characteristics open up new
attack surfaces when considering the real-world deployment of such models. With
a thorough investigation of the attack vectors, we develop a systematic
analysis of membership inference attacks on diffusion models and propose novel
attack methods tailored to each attack scenario specifically relevant to
diffusion models. Our approach exploits easily obtainable quantities and is
highly effective, achieving near-perfect attack performance (>0.9 AUCROC) in
realistic scenarios. Our extensive experiments demonstrate the effectiveness of
our method, highlighting the importance of considering privacy and intellectual
property risks when using diffusion models in image generation tasks
GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models
Deep learning has achieved overwhelming success, spanning from discriminative
models to generative models. In particular, deep generative models have
facilitated a new level of performance in a myriad of areas, ranging from media
manipulation to sanitized dataset generation. Despite the great success, the
potential risks of privacy breach caused by generative models have not been
analyzed systematically. In this paper, we focus on membership inference attack
against deep generative models that reveals information about the training data
used for victim models. Specifically, we present the first taxonomy of
membership inference attacks, encompassing not only existing attacks but also
our novel ones. In addition, we propose the first generic attack model that can
be instantiated in a large range of settings and is applicable to various kinds
of deep generative models. Moreover, we provide a theoretically grounded attack
calibration technique, which consistently boosts the attack performance in all
cases, across different attack settings, data modalities, and training
configurations. We complement the systematic analysis of attack performance by
a comprehensive experimental study, that investigates the effectiveness of
various attacks w.r.t. model type and training configurations, over three
diverse application scenarios (i.e., images, medical data, and location data).Comment: CCS 2020, 20 page
Fed-GLOSS-DP: Federated, Global Learning using Synthetic Sets with Record Level Differential Privacy
This work proposes Fed-GLOSS-DP, a novel privacy-preserving approach for
federated learning. Unlike previous linear point-wise gradient-sharing schemes,
such as FedAvg, our formulation enables a type of global optimization by
leveraging synthetic samples received from clients. These synthetic samples,
serving as loss surrogates, approximate local loss landscapes by simulating the
utility of real images within a local region. We additionally introduce an
approach to measure effective approximation regions reflecting the quality of
the approximation. Therefore, the server can recover the global loss landscape
and comprehensively optimize the model. Moreover, motivated by the emerging
privacy concerns, we demonstrate that our approach seamlessly works with
record-level differential privacy (DP), granting theoretical privacy guarantees
for every data record on the clients. Extensive results validate the efficacy
of our formulation on various datasets with highly skewed distributions. Our
method consistently improves over the baselines, especially considering highly
skewed distributions and noisy gradients due to DP. The source code will be
released upon publication
Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
Limited data access is a longstanding barrier to data-driven research and
development in the networked systems community. In this work, we explore if and
how generative adversarial networks (GANs) can be used to incentivize data
sharing by enabling a generic framework for sharing synthetic datasets with
minimal expert knowledge. As a specific target, our focus in this paper is on
time series datasets with metadata (e.g., packet loss rate measurements with
corresponding ISPs). We identify key challenges of existing GAN approaches for
such workloads with respect to fidelity (e.g., long-term dependencies, complex
multidimensional relationships, mode collapse) and privacy (i.e., existing
guarantees are poorly understood and can sacrifice fidelity). To improve
fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate
that across diverse real-world datasets (e.g., bandwidth measurements, cluster
requests, web sessions) and use cases (e.g., structural characterization,
predictive modeling, algorithm comparison), DG achieves up to 43% better
fidelity than baseline models. Although we do not resolve the privacy problem
in this work, we identify fundamental challenges with both classical notions of
privacy and recent advances to improve the privacy properties of GANs, and
suggest a potential roadmap for addressing these challenges. By shedding light
on the promise and challenges, we hope our work can rekindle the conversation
on workflows for data sharing.Comment: Published in IMC 2020. 20 pages, 26 figure
RelaxLoss: Defending Membership Inference Attacks without Losing Utility
As a long-term threat to the privacy of training data, membership inference attacks (MIAs) emerge ubiquitously in machine learning models.
Existing works evidence strong connection between the distinguishability of the training and testing loss distributions and the model's vulnerability to MIAs. Motivated by existing results, we propose a novel training framework based on a relaxed loss (RelaxLoss) with a more achievable learning target, which leads to narrowed generalization gap and reduced privacy leakage. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Through extensive evaluations on five datasets with diverse modalities (images, medical data, transaction records), our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs as well as model utility. Our defense is the first that can withstand a wide range of attacks while preserving (or even improving) the target model's utility
GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators
The wide-spread availability of rich data has fueled the growth of machine learning applications in numerous domains. However, growth in domains with highly-sensitive data (e.g., medical) is largely hindered as the private nature of data prohibits it from being shared. To this end, we propose Gradient-sanitized Wasserstein Generative Adversarial Networks (GS-WGAN), which allows releasing a sanitized form of the sensitive data with rigorous privacy guarantees. In contrast to prior work, our approach is able to distort gradient information more precisely, and thereby enabling training deeper models which generate more informative samples. Moreover, our formulation naturally allows for training GANs in both centralized and federated (i.e., decentralized) data scenarios. Through extensive experiments, we find our approach consistently outperforms state-of-the-art approaches across multiple metrics (e.g., sample quality) and datasets
Private Set Generation with Discriminative Information
Differentially private data generation techniques have become a promising solution to the data privacy challenge - it enables sharing of data while complying with rigorous privacy guarantees, which is essential for scientific progress in sensitive domains. Unfortunately, restricted by the inherent complexity of modeling high-dimensional distributions, existing private generative models are struggling with the utility of synthetic samples. In contrast to existing works that aim at fitting the complete data distribution, we directly optimize for a small set of samples that are representative of the distribution, which is generally an easier task and more suitable for private training. Moreover, we exploit discriminative information from downstream tasks to further ease the training. Our work provides an alternative view for differentially private generation of high-dimensional data and introduces a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches
Responsible Disclosure of Generative Models Using Scalable Fingerprinting
Over the past seven years, deep generative models have achieved a qualitatively new level of performance. Generated data has become difficult, if not impossible, to be distinguished from real data. While there are plenty of use cases that benefit from this technology, there are also strong concerns on how this new technology can be misused to spoof sensors, generate deep fakes, and enable misinformation at scale. Unfortunately, current deep fake detection methods are not sustainable, as the gap between real and fake continues to close. In contrast, our work enables a responsible disclosure of such state-of-the-art generative models, that allows model inventors to fingerprint their models, so that the generated samples containing a fingerprint can be accurately detected and attributed to a source. Our technique achieves this by an efficient and scalable ad-hoc generation of a large population of models with distinct fingerprints. Our recommended operation point uses a 128-bit fingerprint which in principle results in more than 10^36 identifiable models. Experiments show that our method fulfills key properties of a fingerprinting mechanism and achieves effectiveness in deep fake detection and attribution
Privacy considerations for sharing genomics data
An increasing amount of attention has been geared towards understanding the privacy risks that arise from sharing genomic data of human origin. Most of these efforts have focused on issues in the context of genomic sequence data, but the popularity of techniques for collecting other types of genome-related data has prompted researchers to investigate privacy concerns in a broader genomic context. In this review, we give an overview of different types of genome-associated data, their individual ways of revealing sensitive information, the motivation to share them as well as established and upcoming methods to minimize information leakage. We further discuss the concise threats that are being posed, who is at risk, and how the risk level compares to potential benefits, all while addressing the topic in the context of modern technology, methodology, and information sharing culture. Additionally, we will discuss the current legal situation regarding the sharing of genomic data in a selection of countries, evaluating the scope of their applicability as well as their limitations. We will finalize this review by evaluating the development that is required in the scientific field in the near future in order to improve and develop privacy-preserving data sharing techniques for the genomic context